Add USE_ROCM to op_builder/builder.py by jithunnair-amd · Pull Request #1996 · NVIDIA/apex

jithunnair-amd · 2026-04-20T14:44:37Z

Changes brought over from ROCm#324, as these were the only non-whitespace changes in that PR.

Hipify revamp changes for apex extensions on ROCm.

Skip the unit tests

Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm

Conflicts: csrc/multi_tensor_apply.cuh setup.py tests/L0/run_optimizers/test_adagrad.py tests/L0/run_optimizers/test_fused_optimizer.py tests/L0/run_optimizers/test_lamb.py

Mostly whitespace or formatting issues addressed. Diff with upstream is reduced; ROCm changes are more clear.

IFU-2021-01-18

use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests

- incorrect use of __shfl_down - fix warp size assumptions - update unit tests to exit on failure

…13)" This reverts commit bdd481d.

Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)

IFU-2020-03-04

…n_check Make torch version check numeric

…d_lamb add distributed fused lamb

…clude_dirs work around hipify not finding headers

* Enable group batch norm (--bnp) on ROCm (only bn_group = 1) Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1). The multi-GPU case (bn_group > 1) will be revisited in the future. The following are the main changes: 1) Use MIOpen data structures/functions in HIP instead of CUDNN 2) For the warp-level primitive code, we ensure that the code operates on 64-thread wide warp instead of 32-thread wide 3) Disable all the bn_group > 1 paths Notes: 1) Multi-stream is not tested. 2) We have not optimized for performance * Fix bnp hipification Avoid calling hipify-perl in setup.py and rely on PyTorch's internal hipification mechanism. * Make bnp data pointers contiguous The contrib group batch norm implementation assumes that all input tensors are contiguous. When non-contiguous tensors are passed to the function, it gives a wrong result. This commit explicitly calls .contiguous() to make all input tensors contiguous before accessing them. * Fix HIP lane id in bnp Fix typo * Fix ReLU bitmask for HIP in bnp The ReLU bitmask is derived by using the __ballot function which returns a 64-bit value in HIP. This commit fixes the ReLU bitmask storage size and offsets on ROCm. This patch also fixes the kernel to set ReLU bitmask to 1 when the data is less than or equal to zero (not only less than). Not doing so can cause a stability issue. * Remove multiple of 64 offset for HIP in bnp The multiple of 64 offset is not necessary. * Use FP16 intermediate output to determine whether to rectify in bnp Group batch norm takes FP16 tensors and produces the FP16 output, however, all arithmetic operations are done in FP32, thus intermediate outputs are in FP32. For the fusion kernels, ReLU determines the FP32 intermediate output to decide whether to rectify it. ReLU must rectify the intermediate output if it is less than or "equal" to zero. There is a chance that the intermediate FP32 output is very close to zero, and when it is converted to FP16, it becomes zero. In this case, this output is not rectified when it should be. Since the output is not rectified in the forward pass, the gradient is not rectified in the backward pass. This can cause a stability issue. This patch can have a negative impact on the performance of group batch norm as we perform FP32-FP16 conversion multiple times. * Disable dispatchX ParallelSums in HIP in bnp dispatchX is not required for the bn_group = 1 case. * Use traditional load/store for HIP in bnp The built-in function has a high floating point rounding error. Thus, we replace it with the traditional load/store. Doing so breaks the aligned pointer property in the load/store functions. We conservatively use traditional load/store for all memory access. * Replace shfl_down with shfl_sync in parallel sums for HIP in bnp This commit separates the HIP code from the CUDA code in parallel sums * Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is no longer needed. * Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp * Add test for bnp The test evaluates correctness of batch norm, batch norm + ReLU, and batch norm + add + ReLU against the reference implementation. For the forward activation output, we validate it against the PyTorch's implementation. The group batch norm activation output must be allclose with the PyTorch activation output for the test to pass. For the backward gradient output, we validate it against the Python implementation. Due to the floating point rounding error in the batch norm implementation, the group batch norm gradient output might not be allclose with the Python implementation output when ReLU is being used although the majority of the elements are very close to each other. Thus, we use the norm difference threshold to determine whether the test is passed or failed instead of allclose. * Use the warp size variable than hard coding the warp size in bnp Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use warpSize in the device kernels instead of hard coding the warp size.

…0-15

1) multihead_attn 2) xentropy 3) fused_adam and distributed_fused_adam

fixes :https://ontrack-internal.amd.com/browse/SWDEV-541725

…rp_size() (#237)

* add test to extract extensions from setup.py and test if there can be imported * moved test outside tests/L0

* made a flag to switch on/off aiter compile using --aiter when installing apex * Added information on building AITER during installation in readme

* replace c10_warp_size in fused rope * replace c10_warp_size in fused softmax * replace c10_warp_size in group batch norm * replace c10_warp_size in multiheadattention * replace c10_warp_size in tramsducer * replace c10_warp_size in xentropy * replace c10_warp_size in sync batch normalization * replace c10_warp_size in group batch norm * replace warp_size in multihead attention

…e the test_gelu pass (#269)

…modes (#271)

…ad of version.txt when creating a wheel (#278)

This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible.

…flow (#291) * Created initial code for loading fused_dense module dynamically instead of building it. Code uses accelerator and op_builder modules from deepspeed code. * add apex/git_version_info_installed.py to gitignore as it is dynamically created by setup.py for the build process * add code for building fused rope dynamically * add code for building fused bias swiglu dynamically * fix the code so that fused rope and fused softmax are not compiled in jit mode, add csrc back to setup.py since it is not copied to apex wheel * load the jit modules inside and this prevents them from building when building the wheel * convert syncbn module to jit * fix the unnecessary compile of syncbn module in wheel building due to imports in python module * add fused layer norm module to jit build * make focal loss module as jit module * make focal loss module as jit module * make xentropy module as jit module * make bpn module as jit module * add code to build individual extensions without JIT * clean up the flags for the modules based on apex/setup.py * add function to get the backward_pass_guard_args in CudaOpBuilder and make MLP JIT compile * add fused weight gradient mlp to jit compile * move fused_weight_gradient_mlp_cuda load inside so that it is not compiled during apex installation * make fused index mul 2d jit compile and dd aten atomic header flag method to CUDAOpBuilder to support its jit compile * make fast multihead attention as jit module, add generator_args to CudaOpBuilder support jit of this module * make transducer loss and transducer joint modules as jit modules, add nvcc_threads_args method in CUDAOpBuilder to support these jit modules * remove extra method - installed_cuda_version from CUDAOpBuilder * add apex_C module to jit compile, add py-cpuinfo to requirements.txt as it is needed for TorchCPUOpBuilder * make nccl allocator as a jit compile module, add nccl_args method to CUDAOpBuilder to support this * make amp_C as a jit module * add a few uses of amp_C jit module * add a few uses of amp_C jit module * make fused adam as a jit module * add a few uses of amp_C jit module * fix the issue with fused adam jit module * make fused lamb as jit module * make distributed adam as jit module * make distributed lamb as jit module * add remaining amp_C uses with jit loader * add remaining usage of apexC jit module * make nccl p2p module as jit compile * make peer memory module as jit compile * add code to check for minimum nccl version to compile nccl allocator module * add provision to provide APEX_CPP_OPS=1 and APEX_CUDA_OPS=1 as replacement for --cpp_ext --cuda_ext command line arguments for building specific extensions in apex, save these settings for later use * check for minimum torch version for nccl allocator, check if the module is compatible other removed from installed ops list * add build as a dependency to support wheel building * Replace is_compatible to check for installation conditions with is_supported, because there is an issue with loading nccl allocator * Similar to pytorch we create a make command to install aiter, that the user can use. There will be no building aiter in the setup.py * update extension import test so that it considers jit compile extensions * clean up MultiTensorApply usages so that amp_C is not build in jit compile mode * Adding missing modules from deepspeed repo. Remove extra code in setup.py. Use is_compatible instead of is_supported * change name of apex_C module * change the name of cpp and cuda build flags, remove APEX_BUILD_OPS, cleanup the logic to build specific modules * add missing files used in cpu accelerator * add make clean command to handle deleting torch extensions installed for jit modules, fix the cpu builder import error * remove unused code in setup.py, fix the code to build for cpu mode * Removing unused code * remove accelerator package and refactor the used code into op_builder.all_ops BuilderUtils class * remove accelerator package usages * revert code that was removed by mistake * Cleaning up the setup file and renaming functions and variable to more readable names. * Fix the nccl version so that the nccl_allocator.so file can be loaded properly. Setup() call has an argument called py_modules which copies the python class into sitepackages folder. The python modules in the compatibility folder do lazy load of the builder classes. First these files are copied in the parent folder so that the files themselves are copied into sitepackages so that the kernel can be loaded into python then these temporary files are deleted. * Restore to original importing the extension code. * renamed compatibility/scaled_masked_softmax_cuda.py, added some extra tests in the contrib test runner * Added instructions for JIT load and changes in installation options * Restructuring the README * Added instructions for building wheel * replaced TorchCPUBuilder with CPUBuilder, added a main method in contrib test runner * create a script to build different jit conditions for running different tests * add script to run tests with different jit builds, add instructions to run jit build and tests in readme, add other tests in readme * fix the issues with running the tests - improper paths, counting .so files in apex folder * add mad internal scripts * remove print statement * remove testing section from readme * change location of result file * remove multiple results file from models.json * add platform specific description to wheel name even if no CppExtension or CUDAExtension is built with JIT load approach * add ninja and wheel to requirements to be installed * Update Release notes in Readme * Exclude compatibility folder while installing apex * Update README.md * Update README.md * Update README.md * Adding modification note to the original copywrite * fix the issue with symbolic links for op_builder, csrc when the apex repo is cloned in the docker * assign the symbolically linked folders into a variable and then loop across the list entries * remove unnecessary tabs --------- Co-authored-by: skishore <sriramkumar.kishorekumar@amd.com> Co-authored-by: sriram <sriram.kumar@silo.ai>

* Add support for conv bias relu * Fix compilation failure * omit check_cudnn_version_and_warn check (no cuDNN on ROCm) * Flatten bias for PyTorch from 4D to 1D * Implement fusion of Conv with ReLU with MIOpen * Fix compilation issues * Fix crash for ConvBias * Fix merge issues * Add support for ConvBias and ConvBiasMaskRelu * Fix segmentation fault on bwd for ConvBias * add code for fusing conv+bias for retinanet, add test case for retinanet * Fix torch warning * Fix warnings in a unit test file as well * add builder and loader for fused_conv_bias_relu module --------- Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com> Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com>

… of maximum depth reached (#297) * add details of fused_conv_bias_relu in table of modules and build flag * solve the maximum depth error.

* add code to add loader module for jit module * fix errors to create jit module adder - use correct file name to save code to * fix errors to create jit module adder - use correct class name of the builder and parameter to supply builder module name * fix errors to create jit module loader * add description about jit module script to add jit loader for a jit module with builder provided * add description about jit module script to add jit loader for a jit module with builder provided * add attributes and methods to override when creating a jit module builder * add extra new lines * update jit module to take the builder file name and extract module name from the builder, update missing entries in the table in readme for adding new module in jit * refine the description about module to jit * add description about jit * add description about jit * add code to create a builder based on user inputs * change the example from fused_dense to swiglu * allow user to skip sources list * change description of cxx and nvcc flags, add description of methods and fields in the initial builder code created by script

…RMSNorm. (#304)

Added release notes for version 1.11.0, including new extensions and upgrades. Updated previous release notes for clarity.

* Added GHA CI workflow * Change target branch * Update naming * ci: trigger actions * Move the file * Setup python env * Use containers * These k8s runners don't support native containers, therefore I am running containers in bash * Typo * Fix git dubious ownership * Git fixes * Typo * Cmake change * requirements.txt fix * Clone in container * Resolve latest PyTorch main SHA * Rewrite from scratch * Set rocm * Add sanity check * set -euxo pipefail * typo * Rewritten * Fix tests * Set large timeout for tests * Split the steps * Implement discussed features * Fix tests * Fix tests more * Try tests * Removed the HIP_VISIBLE_DEVICES code * Lock the RCCL context * Force CPU to wait for the GPUs, and we need to force all GPUs to wait for each other before anyone is allowed to reset the memory pool * Revert * Resolve comments * Hausekepping * Run CI * Propagate import errors * Extension tests fix * Apply launch bounds unconditionally * Define USE_ROCM during JIT compilation * Revert some changes * Resolve comments * Fix typo

crcrpar · 2026-04-20T14:55:20Z

it feels like the target branch should be ROCm/master instead of here?

lcskrishna and others added 30 commits December 15, 2020 11:41

fixed spelling mistakes

8efd60b

update readme and minor changes

3fdb8db

Merge pull request #38 from lcskrishna/cl/rocm-hipify-revamp

663d5a4

Hipify revamp changes for apex extensions on ROCm.

skip the unit tests

5bae299

missing import statement

41bbf93

Merge pull request #41 from lcskrishna/cl/skip-tests

76e4e05

Skip the unit tests

Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm

ff232fb

Merge pull request #42 from sarunyap/reduce-block-fix

d061bf2

Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm

Merge remote-tracking branch 'upstream/master'

dcc7b51

Conflicts: csrc/multi_tensor_apply.cuh setup.py tests/L0/run_optimizers/test_adagrad.py tests/L0/run_optimizers/test_fused_optimizer.py tests/L0/run_optimizers/test_lamb.py

update setup.py to more closely align with upstream

2332c4d

Mostly whitespace or formatting issues addressed. Diff with upstream is reduced; ROCm changes are more clear.

missing #include <c10/cuda/CUDAGuard.h>

4ebf2b9

skip failing tests on ROCm

13c8d15

Merge pull request #43 from ROCmSoftwarePlatform/IFU-2021-01-18

85b56d0

IFU-2021-01-18

use __launch_bounds__ for multi_tensor_apply (#44)

5baa68d

use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests

fix cross-compiled ROCm builds when no GPUs detected (#45)

c1e88fa

fix bugs in syncbn (#46)

3f49dbf

- incorrect use of __shfl_down - fix warp size assumptions - update unit tests to exit on failure

Revert "pass all TensorListMetadata as pointer to pinned host memory (#…

fbb8cd9

…13)" This reverts commit bdd481d.

Merge pull request #47 from ROCmSoftwarePlatform/revert_workaround

dde39c9

Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)

Merge remote-tracking branch 'upstream/master' into IFU-2020-03-04

c285a67

Merge pull request #48 from ROCmSoftwarePlatform/IFU-2020-03-04

107f1ff

IFU-2020-03-04

Make torch version check numeric

799785a

Merge pull request #50 from ROCmSoftwarePlatform/numeric_torch_versio…

95797c8

…n_check Make torch version check numeric

enable --distributed_lamb for rocm

955256d

Merge pull request #52 from ROCmSoftwarePlatform/add_distributed_fuse…

02ada95

…d_lamb add distributed fused lamb

work around hipify not finding headers

888e72a

Merge pull request #53 from ROCmSoftwarePlatform/hipify_workaround_in…

37d8410

…clude_dirs work around hipify not finding headers

in multi tensor apply, skip empty tensors (#54)

297ab21

Merge remote-tracking branch 'upstream/master' into IFU-master-2021-1…

f79993d

…0-15

Enable the following modules in apex/contrib:

1fd257e

1) multihead_attn 2) xentropy 3) fused_adam and distributed_fused_adam

amd-sriram and others added 28 commits July 8, 2025 18:37

Replacing c10_warp_size with platform based warp_size values (#228)

95c7ed2

fixes :https://ontrack-internal.amd.com/browse/SWDEV-541725

Fixing the C10_warpsize issue. replacing the macros with at::cuda::wa…

7d9b032

…rp_size() (#237)

Apex extensions import test (#245)

ed2d044

* add test to extract extensions from setup.py and test if there can be imported * moved test outside tests/L0

correct the approach to get to the apex folder from the test file (#248)

6e23ced

Replaced warpsize with C10_WARP_SIZE (#249)

99c6242

Disabling Aiter Installation in default build (#254)

19eed3c

* made a flag to switch on/off aiter compile using --aiter when installing apex * Added information on building AITER during installation in readme

Update version.txt (#261)

61431e1

Update README.md (#262)

7221c68

Fix build error (#264)

1e9236f

reset parameters for FusedDenseGeluDense similar to FusedDense to mak…

62c94ed

…e the test_gelu pass (#269)

update the param_id calculation so that it works on both CPX and SPX …

4b03581

…modes (#271)

Update README.md (#273)

053a9b1

Update version.txt (#274)

34160b8

Update aiter submodule to latest commit (#275)

2190fba

add code to read BUILD_VERSION env variable, so that it is used inste…

4a04a64

…ad of version.txt when creating a wheel (#278)

Update version to 1.10.0 (#282)

b986681

Update README.md (#289)

267d397

Pow implementation is very expensive on AMD CDNA4. (#292)

cfaba56

This commit changes it to a mathematically equivalent exp(y*log(x)) for x > 0. However 1-2 ULP prec loss might be possible.

Bump version from 1.10.0 to 1.11.0 (#293)

e74e09a

add details of fused_conv_bias_relu in table of modules and fix error…

31254da

… of maximum depth reached (#297) * add details of fused_conv_bias_relu in table of modules and build flag * solve the maximum depth error.

Create custom python operators for MixedFusedLayerNorm and MixedFused…

4b5ca60

…RMSNorm. (#304)

Update README with release notes for version 1.11.0 (#310)

6269a50

Added release notes for version 1.11.0, including new extensions and upgrades. Updated previous release notes for clarity.

Add USE_ROCM

8504790

jithunnair-amd closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add USE_ROCM to op_builder/builder.py#1996

Add USE_ROCM to op_builder/builder.py#1996
jithunnair-amd wants to merge 319 commits into
NVIDIA:masterfrom
ROCm:add_use_rocm

jithunnair-amd commented Apr 20, 2026

Uh oh!

crcrpar commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jithunnair-amd commented Apr 20, 2026

Uh oh!

crcrpar commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants